Appears in Ecml-98 as a Research Note a Longer Version Is Available as Ece Tr 98-3, Purdue University Pruning Decision Trees with Misclassiication Costs 1 Pruning Decision Trees
نویسندگان
چکیده
We describe an experimental study of pruning methods for decision tree classiiers when the goal is minimizing loss rather than error. In addition to two common methods for error minimization, CART's cost-complexity pruning and C4.5's error-based pruning, we study the extension of cost-complexity pruning to loss and one pruning variant based on the Laplace correction. We perform an empirical comparison of these methods and evaluate them with respect to loss. We found that applying the Laplace correction to estimate the probability distributions at the leaves was beneecial to all pruning methods. Unlike in error minimization , and somewhat surprisingly, performing no pruning led to results that were on par with other methods in terms of the evaluation criteria. The main advantage of pruning was in the reduction of the decision tree size, sometimes by a factor of ten. While no method dominated others on all datasets, even for the same domain diierent pruning mechanisms are better for diierent loss matrices. Decision trees are a widely used symbolic modeling technique for classiication tasks in machine learning. The most common approach to constructing decision tree classiiers is to grow a full tree and prune it back. Pruning is desirable because the tree that is grown may overrt the data by inferring more structure than is justiied by the training set. Speciically, if there are no connicting instances, the training set error of a fully built tree is zero, while the true error is likely to be larger. To combat this overrtting problem, the tree is pruned back with the goal of identifying the tree with the lowest error rate on previously unobserved instances, breaking ties in favor of smaller trees (Breiman, Friedman, Olshen & Stone 1984, Quinlan 1993). Several pruning methods have been introduced in the literature, including cost-complexity pruning, reduced error pruning, pessimistic pruning, error-based pruning, penalty pruning, and MDL pruning. Historically,
منابع مشابه
Evaluation of liquefaction potential based on CPT results using C4.5 decision tree
The prediction of liquefaction potential of soil due to an earthquake is an essential task in Civil Engineering. The decision tree is a tree structure consisting of internal and terminal nodes which process the data to ultimately yield a classification. C4.5 is a known algorithm widely used to design decision trees. In this algorithm, a pruning process is carried out to solve the problem of the...
متن کاملThe Effect of Fruit Trees Pruning Waste Biochar on some Soil Biological Properties under Rhizobox Conditions
The pyrolysis of fruit trees Pruning waste to be converted to biochar with microbial inoculation is a strategy improving the biological properties in calcareous soils. In order to investigate the biochar effect on some soil biological properties of the soil in the presence of microorganisms, a factorial experiment was carried out in a completely randomized design in the rhizobox under greenhous...
متن کاملSimplifying Decision Trees by Pruning and Grafting: New Results (Extended Abstract)
This paper presents some empirical results on simplification methods of decision trees induced from data. We observe that those methods exploiting an independent pruning set do not perform uniformly better than the others. Furthermore, a clear defmition of bias towards overpruning and underpraning is exploited in order to interpret empirical data concerning the size of the simplified trees.
متن کاملBuilding Simple Models: A Case Study with Decision Trees
1 I n t r o d u c t i o n Many induction algorithms construct models with unnecessary structure. These models contain components tha t do not improve accuracy, and tha t only reflect random variation in a single da ta sample. Such models are less efficient to store and use than their correctly-sized counterparts . Using these models requires the collection of unnecessary data. Portions of these...
متن کاملOn the Boosting Pruning Problem
Boosting is a powerful method for improving the predic-tive accuracy of classiiers. The AdaBoost algorithm of Freund and Schapire has been successfully applied to many domains 2, 10, 12] and the combination of AdaBoost with the C4.5 decision tree algorithm has been called the best oo-the-shelf learning algorithm in practice. Unfortunately , in some applications, the number of decision trees req...
متن کامل